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is subject to copyright protection. Specifically, source code instructions by which 
specific embodiments of the present invention are practiced in a computer system are 
included. The copyright owner has no objection to the facsimile reproduction of the 
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FILTERS INCLUDING ADJUSTED RESULT"; and 

Serial No. 10/057,694, filed on January 23, 2002, entitled "METHODS 
FOR EFFICIENT FILTERING OF DIGITAL SIGNALS." 
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Background Of The Invention 

Field of the Invention 

[03] This invention is related in general to computer processing and more specifically 
to the use of single instruction multiple data (SIMD) instructions to achieve finite 
impulse response filter operations in a digital processor. 

Description Of The Background Art 

[04] Finite Impulse Response (FIR) filter operations are an important type of digital 
computation or processing. FER filters are commonly used, for example, in pre- 
processing, post-processing, motion compensation, and motion estimation for video 
compression standards. The implementation of FIR filters in computer programs, or 
other digital processing approaches, is usefiil in many other applications including audio 
processing, signal conditioning, simulation of electronic components, etc. 
[05] FIR filter operations can be very demanding on digital processing systems 
because of the large number of iterative operations that must be performed very quickly. 
The number of operations, speed of operation, resolution of coefficient values, and other 
factors all contribute to the accuracy of the implementation and the amount of processing 
resources that are necessary to achieve a design goal. In this respect, a slight advantage 
in FIR filter operations that are executed fi-equently (i.e., in an "inner loop" of a program) 
can result in very significant performance gains. 

[06] An FIR filter that is of special interest in video compression and encoding 
techniques is referred to as a transversal or tapped delay filter. These filters multiply a 
set of coefficients to pixel values of a video fi:*ame to generate a new pixel value. Such an 
operation is useful, for example, to compress an image by combining adjacent pixel 
values into a smaller number of pixel values. Typically, this type of FIR filter includes 
only positive coefficients. 

[07J Fig. 1 illustrates four pixel values a^, a^^ ^3, and a^. Subpixel h is desired to be 

the average of the four pixels computed as: 

[08] = (aj + ^3 + ^4 + 2) » 2, (1) 



2 



where » is a bitwise right shift operator. 

[09] In a typical application where pixel values are limited to values in the range 0- 
255, the pixels a^,., .,a^ are each represented in one byte or 8 bits. Thus, a total of four 
bytes is necessary to operate on the four pixel values at once. 

[10] Typically, a single frame in a digital video presentation of moderate resolution 
can include 600x800 = 480,000 pixels. Such a frame might be displayed 30 times per 
second. Moreover, it may be necessary to perform additional "passes" over the frame so 
that, for example, in subsequent passes the pixels, themselves, are combined into 
subpixels to further compress an image. Thus, numerous subpixel computations may be 
necessary. Further digital video formats, such as high-definition television, use much 
higher screen resolutions and color depths. It should be apparent that such filter 
operations could place enormous requirements on processing resources, especially when 
the operations must be performed in real time. 

[11] One approach that the prior art uses to provide increased efficiency in filter or 
array operations is to use Single Instruction Multiple Data (SIMD) instructions. Such 
instructions allow value-packing, byte-packing, or other concatenating of values into a 
single word or other unit of data. The unit of data can be processed quickly by 
performing a desired operation in parallel on the packed values. 

[121 SIMD-type instructions are available in many processors. Examples include Intel 
Multi-Media Extensions (MMX)™ and Streaming SIMD Extension (SSE)™, as well as 
NEC VR5432, Equator MAP-CA™, and Philips TM-1300 processors. In processors 
whose architecture supports SIMD instructions there are typically multiple identical 
processors, N, each with its own local memory where it can store data. All processors 
work under the control of a single instruction stream issued by a central control unit. 
There are typically A/^ data streams, one per processor. The processors operate 
synchronously: at each step, all processors execute the same instruction on a different 
data element. This architecture allows computations in parallel. Thus, if //=8, it is 
possible to achieve a computational speedup of 8. 

[13] Fig. 2 provides an example of the operation of a SIMD instruction. The SIMD 
instruction performs an operation, "OP," on two sets of data: y4=[«|,. , ,,a^], a vector of 8 
data values, each of which is an unsigned S-bit integer, i.e., a,€[0,255]; and 5=[6j,. . .,63], 
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another vector of unsigned integers within the range [0,255]. The final result C=[ci,. . ,,c^] 
is achieved by simultaneously operating on all 8 values of and as = OP b^, for 
z=l,. . .,8. In this example. A, B and C are 64-bit registers in which all 8 values of a,, 6,, 
and Ci are packed as contiguous bytes as shown in Figure 2, i.e., N-S, Such operations are 
also known as packed operations, since 8 values of data are packed in a single register^, 
5 or C. 

[14] One specific type of operation of interest in filter operations is the PA VG 
operation that can be found, e.g., in the Intel MMX™ instruction set. The PA VG 
instruction performs the following computation: 
[15] PAVGiAfi) = » 1, 1=1,.. .,8]. (2) 

[16] This operation takes 8-bit values of a„ b^, and stores the intermediate sum 
(a^-^b^-^l) in 9 bits before doing bitwise logical right shifl operation to get the final result. 
It is available in many processors, including the ones mentioned above, and uses only one 
instruction. This instruction has the latency of 1 clock cycle in the Intel Pentium III, 2 
clock cycles in Intel Pentium 4, and Advanced Micro Device's (AMD's) Athlon, with a 
throughput of 1 clock cycle. The same performance is realized for other operations in 
these architectures, such as packed addition (+), subtraction (-), bitwise AND (&), bitwise 
OR (I), bitwise EXCLUSIVE-OR (^), bitwise right shift (»), and bitwise left shift («) 
operations. 

[17] Although SIMD instructions can improve the efficiency and speed of 
computations, such instructions are sometimes difficult to use effectively when the SIMD 
instructions do not provide the exact type of operation needed. For example, as stated 
above, PAVG computes (a^+fe^+l) » 1. An average of two vectors rounded up . 
However, it is more desirable in some filter operations to obtain which is a 

truncated average where the remainder, or fractional part, is discarded. Such a difference 
in operation is significant where multiple passes of frame data are made as the average 
intensity value of subpixels may increase and result in artifacts or other objectionable 
qualities to the processed data. In the architectures discussed herein, a SIMD instruction 
to compute (fl/+6/) » 1 is not provided. Typically, a non-SIMD approach must be used. 
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[1 8] A problem also arises when the number of arguments required by a SIMD 
operation is not the same as the number of variables in a formula to be implemented by 
the SIMD operation. For example, if a SIMD instruction accepts two arguments then it is 
"mismatched'* to implement a formula, computation or operation with more than two 
variables or values. The same can be said, for example, for a SIMD instruction with 
three arguments used to implement a formula with other than three variables, etc. 
[19] Fig. 3 illustrates a non-SIMD approach to compute (a,+6,)»l. 
[20] In Fig. 3, a^, bi are unsigned integers within the range [0,255], i.e., each a^, 6/ is 
represented in 8 bits. The number of processors, iV=8, i.e., the operation {af\-b^»\ is 
simultaneously performed on 8 values of a, and for /=I,. . .,8. All 8 values of a, 
(usually contiguous pixels) are packed in 64-bit register, A, and 8 values of 6, in 64-bit 
register B, Since a^^bi can exceed 8 bits, the 8-bit (byte) values of a,, 6, are unpacked into 
16-bits (words) as four 16-bit values per 64-bit register,. Then the packed registers A and 
B are added together, followed by bitwise logical right shift by 1, followed by packing 
again. Note that in most processors, data can be packed into 64-bit registers as 8 (byte), 
16 (word), 32 (dword), or 64 (qword) bit values only. Figure 3 shows the conventional 
method of doing the packed operation c,=(a,+6,)»l for z=l,. . .,8. It is clear from Figure 
3, that given sufficient memory, 9 instructions are needed to achieve the result 
Cf^{a-^b^»\ for all 8 values of a, and 6,. Each instruction in Figure 3 is represented by 
an ellipse. 

Summary of Embodiments of the Invention 

[21] The invention provides a system for efficient derivation of finite impulse response 
(FER) values. A single-instruction multiple data (SIMD) type of operation is used. In a 
preferred embodiment, the operation is achieved by an instruction called PA VG. The 
results of PA VG are a rounded-up average of two sets of packed values. Adjustments are 
made on the rounded-up average to obtain an exact desired result for various filter 
calculations. 
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[22] The invention also provides approaches to achieving approximate desired results 
that differ from the exact desired results yet remain within acceptable error ranges. The 
approximate approaches require less computation and can be advantageous in different 
applications, or embodiments, of the invention. An adjusted approximate approach 
improves the accuracy of the approximate approach. Various techniques for minimizing 
processor resources (e.g., processing cycles, memory) are presented. 
[23] These provisions together with the various ancillary provisions and features which 
will become apparent to those artisans possessing skill in the art as the following 
description proceeds are attained by devices, assemblies, systems and methods of 
embodiments of the present invention, various embodiments thereof being shown with 
reference to the accompanying drawings, by way of example only, wherein: 
[24] One embodiment of the invention provides a method for obtaining an average of a 
plurality of values, wherein a first plurality of values is stored in a first packed structure, 
wherein a second plurality of values is stored in a second packed structure, the method 
comprising using an averaging operation on the values in the first and second packed 
structures to obtain a plurality of values in a packed average result, wherein a value in the 
packed average result equals a rounded-up average of a value in the first packed structure 
and a value in the second packed structure; determining whether the sum of the value in 
the first packed structure plus the value in the second packed structure is an odd number 
and, if so, performing the step of subtracting one from the value in the packed average 
result to obtain a packed adjusted result. 

[25] Another embodiment provides a method for adjusting the result of a PA VG 
instruction, wherein a first set, A, of packed values, ai, and a second set, B, of packed 
values, bi, are operated on by PA VG to obtain iMKG(A,B) = [( +1) » 1, i=l,. . .,8], 
the method comprising adjusting the result of the PAVG operation to obtain a packed 
value result, C, as C = PA VG(A, B)-{A''B)8c 0x01 . 

[26] Another embodiment of the invention provides a method for achieving an 
averaged result on packed binary values ^i,/i2» ^3»^4> the method using a PA VG 
instruction that computes a rounded-up average on first and second sets of packed values 
to produce a resulting set of packed averages, wherein Bl = PA VG{A j ^ 
PAVG{A;^yA^), the method comprising deriving a result, R, as 
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^{PAVGiBy,B2)-{Bx^B2)SLONE when£ = 0 
~[ PAVG(Bi,B2) when£ = l 

wherein ONE is a value with a one in the least significant bit position of one or more 
packed values (although typically a one value is used in the least significant bit position 
of every packed value) and wherein £ = 1 when both (A^+Ai-^ONE) and (A^-^A/^-^ONE) 
are odd integers. 

[27] Yet another embodiment of the invention provides a method for achieving an 
approximate averaged result on packed binary values ^1^2^ ^3>^4» ^se in finite 
impulse response filter computations, the method using a PAVG instruction that computes 
a rounded-up average on fij^t and second sets of packed values to produce a resulting set 
of packed averages, wherein Bi = PA VG{A 1^2) ^2 ^ ^^(^3^4)9 the method 
comprising deriving a result, i?, as = PAVG{B^,B2) - {B^^B2) & ONE 
wherein ONE is a value with a one in the least significant bit position of one or more 
packed values. 

[28] Still another embodiment of the invention provides a method for achieving an 
averaged result on packed binary values for use in finite impulse response filter 
computations, the method using an instruction, PAVG, that computes a roimded-up 
average on first and second sets of packed values to produce a resulting set of packed 
averages, the method comprising detecting when the use of the PAVG instruction 
introduces a rounding-up increase in an averaged result; and decreasing the rounding-up 
increase to achieve a desired result. 



Brief Description of the Drawings 

[29] Fig. 1 illustrates a subpixel average of four pixel values; 

[30] Fig. 2 shows an example of the execution of a single-instruction multiple-data 

(SIMD) instruction; 

[31] Fig. 3 shows a non-SIMD approach to a calculation; and ( 
[32] Fig. 4 shows a SIMD implementation with adjustment. 
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Detailed Description of Embodiments of the Invention 

[33] A preferred embodiment of the invention uses Intel's MMX/SSE architecture, 
including the SIMD PA VG operation. Other embodiments may use other processors, 
instructions and operations in a manner similar to that disclosed herein and realize similar 
computational benefits. In addition, other techniques and approaches for performing 
processing may benefit fi-om one or more of the features presented herein, such as the 
techniques of the related patent application "METHODS FOR EFFICIENT FILTERING 
OF DIGITAL SIGNALS," cited above. 
[34] Table I shows notations used in this application. 



Operator 


Description 


+ 


Addition 




subtraction 


& 


bitwise AND 


1 


bitwise OR 


A 


bitwise exclusive OR 


» 


bitwise logical right shift 


« 


bitwise logical left shift 




Bitwise NOT 


CLIPix) 


Clips X to range [0,255] 


ODD(x) 


Returns 1 when x is odd, 0 otherwise 


EVEN{x) 


Returns 1 when x is even, 0 otherwise 



TABLE I 



[35] The present invention allows computing c,=(a,+Z>,)»l for /=!,. . .,8, in an efficient 
manner using a SIMD instruction such as PAVG. Note that simply using the PAVG 
instruction on packed values in registers A and B will not yield the correct answer. For 
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example, when is an odd number PAVG(ai, gives a result that is one more than 
the correct answer. The result of a PA VG operation must be adjusted as follows: 

C = PA FG(A, B) - ( A ^ B) & 0x0 1 , 
where 0x01 is a 8-bit nxmiber whose least significant bit is 1 and the rest are O's. 
[36] The PA VG operation with adjustment is shown in Fig. 4. Assuming sufficient 
memory, only 4 instructions instead of the previous 9 instructions (without using PAVG) 
are needed to achieve the packed operation C=(^+5)»l. This is an approximate speedup 
of9/4 = 2,25 times. 

[37] A preferred embodiment of the invention achieves the same computational result 
as in Fig. 4 with even fewer instructions by appropriately using the PA VG instruction in 
combination with supplemental logical operations to adjust for the rounded-up average. 
As described below, several FIR filtering operations can be modified to obtain result in 
fewer instructions when compared to conventional SIMD implementations. 
[38] Without loss of generality, let v4 1 , ^42, . . . ^ i g be 1 6 vectors, each of which contain 
8 packed data elements. For example, A^ contains 8 data elements A^ = [^^(51), . . ^(5,8)]. 
Each data element • . <2(i6,i) i^U- . .8, is within the range [0,255], i.e., they are 
represented by bytes, and ^1,. . .,^4 15 are packed 64-bit registers: 

^7 = ^(/,8)1 fory=l,...,16. (4) 

[39] We perform various operations on the packed 64-bit registers A^,,,,,Ax^Xo obtain 
different FIR filters described below. We define packed 64-bit vectors/registers ONE and 
ONE^ as follows: 

ONE^ [0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01], 
ONE^ = [0x0001, 0x0001, 0x0001, 0x0001], (5) 
where 0x01 is a byte containing 1 in its least significant bit and 0*s elsewhere. The 
packed 64-bit register ONE contains 8 packed bytes, each containing 0x01 . On the other 
hand, the packed 64-bit register ONE/^ contains 4 packed words (16 bits), each containing 

0x0001. 

[40] The FIR filters used in a preferred embodiment include: 

1. Type 1 Filter: {A^^A2+c''0NE) » 1, where cg {-2,-1,0,1,2}, (6) 

2. Type 2 Filter: (A^+A2^A'^-^A^-^c^0NE) » 2, where ce {0,1,2}, (7) 
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3. Type 3 Filter: (^,+^2+^3+^4"^^5+^6"^^7+^8+^*<^^^ ^> 3, where ce {0,1,2,3,4}, (8) 

4. Type4Filter: (^,+^2+---+^i5+^i6+<^*C>^^ >^ 4, where ce {0,1,2,3,4,5,6,7,8}. (9) 



[41] All 4 types of FIR filters are usefiil for video compression applications. There are 
numerous FIR filters that can be constructed fi'om these 4 basic types, in addition to those 
described herein. For example, the filter (2^|+i42+^3H-2*0//£) » 2 is a Type 2 filter 

with.4i=^4. Similarly, the filter (^,+2^2+2^3+2^4'^^5+4*OA^£) » 3 is a Type 3 filter 

with A2='A^, A^^A-j, and A^=A^. Many other types of filters can be constructed as will be 

apparent to one of skill in the art. 

[42] Instructions according to the present invention can be used to obtain exact filter 
computations. Such exactness may be necessary as, e.g., in motion compensation and 
estimation applications where accuracy is key. In other cases an approximation of the 
filter computation may be sufficient. For example, in cases where the number of 
operations is large an approximate computation can be a better tradeoff. The 
approximations of the preferred embodiments produce an error of ±1 in the final result 
for a small percentage of all values of ,)€[0,255] for /=!,. . .,8, and 7=1,. . .,16. These 

results are useful in cases such as post processing, where a small error of ±1 (in intensity 
or color value) is inconsequential in the final result. Naturally, other approximations of 
different degrees of accuracy are possible and are within the scope of the invention. 

I. Type 1 FIR Filters 

[43] There are 5 variations of the Type 1 FIR filters {Ai+A2+c*0NE), where 
c€ {-2, -1,0,1,2}, based on the 5 choices of constant c. We state the SIMD 



implementation for each of these filters: 

iAi+A2-2*ONE) » 1 = ^^4^0(^1,^2) - ONE - (^i^^j) & ONE, (10) 

iAi+A2-0NE)» 1 = CLIP(PAVG{A^A2)-0NE), (11) 

{A^+A2) » 1 = PAVGiA^Ai) - (^i''^2) & ONE, (12) 

{A 1+A2+0NE) »l=PA VGiA j Ai), ( 1 3) 

{A^+A2+1*0NE:) » 1 = PAVG{A^A2) + H^i^^2) & ONE). (14) 



[441 There is a less efficient solution for (/1 1+^2) ^ that will be used to simpUfy 
expressions: 
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{A^-^A2) » 1 = (^1 » 1) + {A2 » 1) + (^1 &A2& ONE). (15) 
[45] Although (15) uses more instructions that (12), we need this expression to 
evaluate other filters. In (15), (.4 1 & ^2 & ONE) is a correction term that is necessary 
when both^i and ^2 contain odd integers. An approximate solution for (^1+^42) » 1 is: 
(A^^A2) »l^PA VG(CLIP{A^'0NE)^2) or VG{A^,CLIP(A2-0NE)y (16) 
[46] In most processors, subtract and CLIP() can be realized in one instruction. So the 
implementations in (16) require only 2 instructions. 

II. Type 2 FIR Filters 

[47] There are 3 variations of Type 2 filters (7) based on the 3 choices of constant c, 
where ce {0,1,2}. We show the derivation of each filter. We define the following 64-bit 
packed registers, each containing 8 data elements of one byte each: 

5, ^PAVGiA^J2\ B2 ^PAVG{A^A^\ EB^ = (^,^^2). EB2 = (A^^^A^y (17) 

A. Type 2. Filter 1: R = (A i + A7 + Ai + + 2*ONE) » 2 
i. Conventional SIMP Solution 

[48] This filter can be implemented in SMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (4 Instructions) A^^ = Unpaclc Low 4 Bytes of Ai, for z = { 1 ,2,3,4} , 

2. (4 Instructions) A^^ = Unpack High 4 Bytes of A^, for / = { 1 ,2,3,4} , 

3. (5 Instructions) Add and Shift lower 4 words of i4i,...,i44 to obtain lower 4 words of 
Ri as: 

= (.4iL + ^2L + ^3L + ^4L + 2^0NE^) » 2, 

4. (5 Instructions) Add and Shift higher 4 words ofAi^.^.yA^ to obtain higher 4 words of 

Rii as: 

= (^IH + ^2H + ^3H + ^4H + 2^0NE^) » 2, 

5. (1 Instruction) Pack R^ and R^ into final register R. 

We require 19 instructions to perform this filter by conventional SIMD methods. 
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ii. Efficient SIMP Solution 

[49] In order to implement this filter efficiently, we simplify as follows: 

R = (({Ai+A2+0NE)»l) + (iA-i+A4+0NE)»\) +E)»l = {B^+B2+E) » I, (18) 
where E is the correction term that is necessary when both (A1+A2+ONE) and 
(A2+A4+ONE) are odd integers as in (15). Detection of odd or evai integers is performed 
with the functions ODDQ and EVENQ. Where ODDQ returns "1" for each packed 
argument value only if the packed argument value is an odd number and returns "0" 
otherwise, and where EVENQ retums "1" for each packed argument value only if the 
packed argument value is an even number and retums "0" otherwise. 

E=0DD(Ai+A2+0NE) & ODDiA^+A^+ONE) = EVEN(Ai+A^ & EVENiA^+A^) 

= ~(Ai''A2) & -(A^^^a) & ONE — (EBi \ EB^ & ONE. (19) 
We note that Ee {0,1 }. From (1 8) and (12), we have: 

{PA VG(B, ,B2)-(B,^B2)& ONE when E = 0 
R = < V I V 1 2/ ,20) 

[ PAVG(B^,B2) when£ = l ^ ^ 

We simplify (20) as: 

R=PA VGiBM - (B^^^i) &~E& ONE, 

which is same as: 

R = PAVG(BM - (5i^^2) & ((^i''^2) I (^3^^4)) & ONE. (21) 
The solution in (21) requires 10 instructions. We have an approximate 19:10 (approx. 
2: 1) speedup by using (21). 

iii. Approximate SIMP Soiution 

[50] Besides the accurate solution, we can obtain an approximate solution in fewer 
instructions by assuming the least significant bit of EB^ or EB2 as 0 or 1. Assuming the 
least significant bit of EB^ or EB2 = 1, we get: 

R=PA VG(BM - (5,^52) & ONE. (22) 
This solution requires 6 instructions, and according to (16), it is close to the following: 

R = PA VGiCLIPiBi-0NE),B2) or R = PA VG{B^,CLIP{B2-0NE)) . (23) 
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This solution requires only 4 instructions, and produces a maximum error of ±1 in the 
final result for 12.5% of all possible values of ^i,...,A4 between [0,255]. The error never 
exceeds ±1. We get a computational efficiency of 19:4, nearly 5 times speedup. 



B, Type 2. Filter 2: R = (A i + + + Aa + ONE^ » 2 

i. Efficient SIMP Solution 

[51] As seen in Section 3.A, this filter can be implemented by conventional SIMD 
methods in 19 instructions. For efficient implementation, we simplify as follows: 

R = (({A^+A2+0NE)>>l)+{{A^^A^)>>\)-^E)>>\ = (B^+B2HE-{EB2&ONE))) » 1. 

(24) 

Here EB2 is due to the correction term in (12), and E is the correction term in (15) as: 
E = ODDiA^^Ai^ONE) & OZ)Z)(^3+^4) = EVEN(A^'^A2) & ODD(A^+A^) 

= -(A 1 ^^2) & (A^^^d & ONE = (-EB^ & EB2) & ONE, (25) 
Wc note that Er=(E-(EB2&0NE)) e {0,-1}. From (24), (11), and (12) we obtain: 

__ J PAVG(B^,B2)-(B^''B2)&ONE when Ej^=0 
^"1 PAVG(B^,B2)-ONE whenEj^='-l ^^^^ 

Note that {E-(EB2&0NE)) = -1 when (^i^yi2) & (^3^^d & ONE = 1. We simpUfy (26) 
as: 

R = PAVG{BM - ((^i^^2) I ((^i^^2) & (^3"^4))) & ONE. (27) 
The solution in (27) requires 10 instructions, an approximate 19:10 (nearly 2 times) 
speedup. 

ii. Approximate SIMD Solution 

[52] We can obtain four approximations of (27) by assuming the least significant bit of 
EB\ or EB2 as 0 or 1 . A good approximate solution is with the assumption that the least 
significant bit of £i?2=0, which gives us the same solutions as (22) and (23), which 
require 4 instructions and has a maximum error of ±1 for 12.5% of all possible values of 
A\y,,,,A4 e [0,255]. We get a computational advantage of 19:4. 
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C. Type 2. Filter 3: R = TA i + A, + A, + ) » 2 

i. Efficient SIMP Solution 

[53] The filter can be implemented by conventional SIMD methods in 17 instructions. 
For efficient implementation, we simplify as follows: 

R = (((^,+^2)»l)+((^3+^4)»l)+^»l = iBi+B2+(E-{EBi+EB2)&ONE))»l. 

(28) 

Here EB^ and EB2 are due to the correction term in (12), and E is the correction term in 
(15) as: 

E=0DD(Ai+A2) & ODDiAi^A^) = (^,^^2) & (Ai^^^d & ONE = EB^ & EB2 & ONE. 

(29) 

We note that E-r{E-{EBi+EB2)ScONE) e {0-1}, and R is same as (26). Note that £7. = - 
1 when (^,^2) I (.43^4) & ONE = 1. We simplify (26) as: 

R = PAFGiBM - {{B^^B2) \ (^,^^2) I (^3"^4)) & ONE. (30) 
The solution in (30) requires 10 instructions. We have an approximate 17:10 speedup by 
usmg (30). 

ii. Approximate SIMD Solution 

{54] We can obtain four approximations of (30) by assimiing the least significant bit of 
EB\ or EB2 as 0 or 1. A good approximate solution is with the assumptions that the least 
significant bits of EB\ or EB2 =1, which gives us: 

R = CLIPiPA FG(5, ^2) -ONE). (3 1) 

This solution requires 4 instructions and has a maximum error of ±1 for 12.5% of all 
possible values of v4i,...,/44€ [0,255]. We have a computational advantage of 17:4, approx. 
4 times. 

D. Type 2. Special FUter 1; R = (2A i + A, + A. + 2*0NE) » 2 

[55] This filter is same as Filter 1 with Ai= A2. It can be implemented by conventional 

SIMD methods in, e.g., 17 instructions. This type of filter is used extensively in, for 
example, standards proposed by the Joint Video Team (JVT) as, for example, in 
[CHANGE TfflS - CHANCHAL TO UPDATE TO MORE CURRENT 
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REFERENCE -> ISO/IEC MPEG and ITU-T VCEG, Geneva, Switzerland, Oct., 02; 
entitled "Editor's Proposed Draft Text Modifications for Joint Video Specification (ITU- 
T Rec. H.264 | ISO/IEC 14496-10 AVC), Geneva modifications, draft 26.and other 
coding schemes. <r END REFERENCE] 
[56] We can simplify (2 1 ) as: 

R = PAVG{A^fi2) - i^\^B2) & (A^'^A^) & ONE, (32) 
This solution requires 7 instructions. One can verify that (32) is close to the following: 

R = PAVG{A^,PAVG{CLIP{Ay-ONE)AA)\ (33) 
which requires only 3 instructions instead of 17 instructions by conventional SIMD 
methods, a nearly 6 times speedup. However, (33) produces an error of ±1 for a very 
small 0.1% of all possible values of ^i,..., A^ €[0,255]. 

E. Type 2, Summary of Results 

[57] Table II below summarizes the instructions required to compute each filter (given 
sufficient memory) by the efficient and conventional SIMD methods. For the Efficient 
case, we give the instructions required for the exact and approximate solutions. 



Summary of Results for Type 2 FIR Filters. 



Type 2 Filters 


Conventional 
Method 


Efficient Method 


Speedup 


Exact 


Approx. 


Exact 


Approx. 


+ ^2 + ^3 + ^4 + '2-*ONE) » 2 


19 


10 


■4 : • 


1.9 


4,75 


(^1 +A2+A:i+A4 + ONE)»2 


19 


10 


4 


1-9 


4.75 


iAi+A2 + A2-^A4)»2 


17 


10 


4 


1.7 • 


4.25 


{2A1+A2 + A2 + 2*0NE) » 2 


17 


■r 7 


3 


.2.4 


.5.67 



TABLE II 

[58J The shaded areas show significant improvements in efficiency due to the analyses 
developed here. 
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3. Type 3 FIR Filters 

[59] There are 5 different Type 3 FIR filters depending on the 5 choices of c in (8). We 
define the following packed 64-bit registers, each containing 8 data elements of one byte 
each: 

5j =PAVG(A^A2% B2 = PAVG{A^Aa\ B^^ = PAVG(AsA^). B^ = PAVG{A^Az\ 
Ci =PAVG(BM> C2 = PAVG{BM 

EB^ = {A^^A2\ EB2 = (^3^^4). ^^3 = (^5"^6). ^^4 = i^l^M 

EC, = (B,^B2), EC2 = (B^^B^y (34) 

A, Type 3. Filter 1; R= (A i + + + A>. + + + + + 4*ONE) » 3 

i. Conventional SIMP Solution 

[60] This filter can be implemented in SIMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (8 histructions) A^i^ = Unpack Low 4 Bytes of A^, for z = 1 , . . . ,8, 

2. (8 Instructions) A^^ = Unpack High 4 Bytes of .4,, for /=!,.. .,8, 

3. (9 Instructions) Add and Shift lower 4 words of ^j,. . .^g to obtain lower 4 words of 

/?L as: 

4. (9 Instructions) Add and Shift higher 4 words of ^i,...^8 ^ obtain higher 4 words of 
i?H as: 

i?H = (^IH + ^2H + ^3H + ^4H + ^5H + ^6H + ^7H + ^8H + 4*07^^4) » 3, 

5. (1 Instruction) Pack and Ri^ into final register 

We require 35 instructions to compute this filter by conventional SIMD methods. 

ii. New SIMD Solntion 

[61] In order to implement this filter without unpacking, we simplify it as follows: 

R = (((^1+ ^3+ ^4+2*OA^jE)»2) + 

((^5+ ^7+ ^8+2*ONE)»2) + £) » 1 , (35) 
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where E is the error/correction tenn that is necessary for dividing up the expression into 
two parts each containing a Type 2, Filter 1 (21). Note that according to (2 1) and the 
registers in (34), we have: 

(Ai+A2+A2+A4+2*ONE)»2 = Ci - (E^ & ONE), 
(A5+A6+Ay+As+2*ONE)»2 = - (^2 & ONE), (36) 
where = EC^ & (£'5, | EB2) and E2 = «& (£'53 | EB4) are error/correction terms 
obtained in (21). We can simplify (35) as: 

R = iCi + C2 + (E-Ei- E2)&0NE) » 1 . (37) 
We now find the expression for E in terms of the packed 64-bit registers in (34). The 
simplification in (35), amounts to the following: 

(P+Q)»3 = ((P»2) + (Q»2) + E)»l, (38) 
where P and Q are unsigned integers. Let Pq be the least significant bit of P and pj the 
next significant bit of P. Similarly, let qQ be the least significant bit of Q and ^, the next 
significant bit of Q. The simplification in (38) results in an error E when the last 2 bits of 
P and Q add up to a number > 4. The condition that determines this error E is: 

(Pi &^i)l(Po&^o&(Pi ki))- 
We can prove that p\,po, q\, can be expressed in terms of the registers in (34) as the 
least significant bits of the following packed 64-bit registers respectively: 

P^ = {EC^^~{EB,\EB2)), 

Po = (EBy^EB2), 
Q,=iEC2^-(EB^\EB,)), 

Qo = {EB^-EB,), (39) 
From (39), we can express £ as the least significant bit of: 

E=(Pi&QO\ (Pq &Qo& (P, I QO)- (40) 
We note that £ e {0, 1 } , and £7. = (£ - £"1 - E2) e {-1 , 0, 1 }. From (37), we have: 



R = 



PAVG(Ci,C2) - ONE when Ej = -1 

PAVG(Ci,C2)-(q''C2)&ONE v/hmEr=0. (41) 
PAVG{C^,C2) when £7.=! 



We simplify (41) as: 
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R=PAVG(Ci,C2) - ((Ci^C2) I {E^^E2^E)) & (^i | ^2 & ONE, (42) 
where E^ = EC^ & (EB^ \ EB2) and £2 = EC2 & (£^^3 1 EB^), We can further simpHfy (42) 
as an expression in terms of £'Ci, JSC2, ^^j, £^2' EB^^ and £"54 so that we can skip the 
computations of E^, E2, and E as follows: 

U=EC^\EC2. 
V ^EB^ \ EB2, 
W ^EB^\EB^, 
X=V\W, 
Y =U\X, 
Z ={EC^&EC2&X), 
T = U &V&JV8C {{EB^ &EB2) I (EB^&EB^)), 
R = PAVGiCi,C2) - ((C,^C2) I Z I 7) & y & ONE, (43) 



[62] The solution in (43) is shown in pseudo-code in Table III, below. Any suitable 
language, coding technique, circuitry or combination of hardware and software can be 
used to achieve the functionality shown in the pseudo-code presented herein. The 
approach of Table III uses 32 instructions as compared to the conventional 35 
instructions. The count of 32 instructions is obtained by counting each logical and 
arithmetic operation of (43) along with those of (34). Other instruction counts in this 
appUcation are obtained, similarly. Clearly, the approach of Table III is not as efficient 
as the Type 2 algorithms. However, there are at least 2 benefits of this approach: 

(1) We can systematically arrive at approximate solutions by making assumptions 
on the error/correction terms £Ci, EC2, EB^, £^2, EB^, and EB^ (see Section 
3.A.iii). 

(2) In special cases, where various A^'s are same, we can simplify the computation 
considerably and obtain efficient exact and approximate solutions (see Sections 
3.F-3.H). 
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#define P(a,b) (((a) + (b) + 1) » 1) 

bl=P(a01,a02); 
b2=P(a03,a04); 
b3 = P(a05,a06); 
b4 = P(a07,a08); 
cl =P(bl ,b2); 
c2=P(b3 ,b4 ); 
d =P(cl ,c2); 
ebl = aOl ^ a02; 
eb2 = a03 ^a04; 
eb3 = a05 a06; 
eb4 = a07 a08; 
ecl=bl '^b2; 
ec2 = b3 ^b4; 
ed =cl ^c2; 
u = eel I ec2; 
V = ebl I eb2; 
w = eb3 I eb4; 
X = V I w; 
y =u|x; 

z =ecl &ec2 &x; 

t = u & V & w & ((ebl & eb2) | (eb3 & eb4)); 
e = ((ed & y) I z 1 1) & 0x01; // Exact solution 
xl =CLIP(d-e); 

TABLE m 



iii. Approximate SIMP Solution 

[631 We have many approximate solutions by assuming the least significant bit ofEB\, 
EB2, EB3, EB4, EC\, or EC2 as 0 or 1. With the assumption that the least significant bit of 
EBx = l, and EB2 = EB2 = EB4 = 0, we get from (43): 

R = PAVG{Ci,C2) - ((Cj'^Cj) I (£Ci & ECi)) & ONE. (44) 
[64] An example pseudo-code implementation of this solution is shown in Table IV, 
below. This approach uses 14 instructions, and produces a maximum error of ±1 in the 
final result for 9.38% of all possible values ofAi,...,A^ between [0,255]. The error never 

exceeds ±1. This solution with 14 instructions, and a maximum error of ±1 for less than 
1/10* of the data is acceptable in many applications like post-processing, where a 
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difference of 1 gray value in the displayed frame is imperceptible to most of us. Yet, we 
receive a computational advantage of 35:14. 

[65] The second approximate solution makes the assumption EB\ = 0, and EB2 = 1. It 
produces the following solution: 

T = (ECi I EC^ & (EB^ I EB4) & EB^ & EB4, 
R=PA VG(Ci,C2) - ((Ci^Cj) I (£C, & EC2) I T) & ONE, (45) 
This solution requires 22 instructions, and produces a maximum error of ±1 in the final 
result for 6.25% of all possible values o{Ai,...,A^ between [0,255]. 



#define P(a,b) (((a) + (b) + 1) » 1) 

bl=P(a01,a02); 

b2 = P(a03,aO4); 

b3 = P(a05,a06); 

b4 = P(a07,a08); 

cl=P(bl ,b2 ); 

c2 = P(b3 ,b4 ); 

d =P(cl ,c2); 

eel =bl ^b2; 

ec2=b3 ^b4; 

ed = cl c2; 

e = (ed I (eel & ec2)) & 0x01; //approx = 9.375% 
xl =CLIP(d-e); 

TABLE IV 



B. Type 3. Filter 2; R = (A , + A2 + A^i + + + A2 + Ao + 3*ONE)»3 

[66] We require 35 instructions to compute this filter by conventional SIMD methods. 
For the new SIMD solution, we write the filter as: 

[67] R = ((iA^+A2+A2,+A^+2*ONE)»2) + ((A^+A^+Aj+A^+ONE)»2) + E)»h 
where E is the error/correction term that is necessary for dividing up the expression into 
two parts each containing a Type 2 Filter. We have: 

/? = (C, + C2 + (i: - £1 - E^&ONE) » 1, 
iA^+A2+A2+A4+2*ONE)»2 = Cj - (£, & ONE), 
(As+A6+Aj+Ag+ONE)»2 = C2 - (^2 & ONE), 
Ey = ECi & (£5, 1 EB^, E2 = EC2 1 (EBj, & EB^), 
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E = (Pi&QO\iPo&Qo&iP,\QO). 

Here and £^2 error/correction terms obtained from (21) and (27) respectively. 
Defining Et={E-E^- E2) e {-2, -1,0}, we have: 

PAVGiCi,C2)-(.Ci''C2)&ONE wheaEr = 0 
R = < PAVGiQ,C2)-0NE when £7- = -1 . (46) 

PA KG(C, , C2) - ONE ~(Ci^C2)& ONE when ^ly- = -2 

We simplify (46) as: 

S = (Ei^E2^E), 

R = PA FG(C, ,C2) - ((£, & £2 & I -S) & OA^^ - ((C, ^ C2) & ~5) & OAT^, (47) 
This solution can be further simplified as: 

P = EB2&EB4, 

U=EBi &EB2&P, 
V=ECi &EC2, 
W^EB^IEB^, 
X= {ECi I EC2) & ((£B, & {EB2 I W)) I (^^2 & I ^), 

y=(X|F|co, 

£I> = (C,^C2). 

i? = VG{Ci .C2) - ((£•£) I Y) & OA^£) -(U&V&ED& ONE). (48) 
The solution in (47) requires 35 instructions, same as the conventional 35 instructions. 
[68] An approximate solution of (47) can be obtained with the assimiption that the 
least significant bit of = EB2 = 1 , and EB^, = EBu = 0 is: 

R=PA VG{C^ ,C2) - ((C|^C2) I ECi \ EC^) & ONE. (49) 
[69] This solution requires 14 instructions, and produces a maximum error of ±1 in the 
final result for 9.38% of all possible values oiAy,...^^ between [0,255]. We receive a 
computational advantage of 35:14. 

[70] The second approximate solution makes the assumption EBl = 1, and EB2 = 0. It 
produces the following solution: 
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Y= ((£C, I EC2) & (EB^ I ^^4)) I (ECi & EC2), 
R = PA VG{C^ ,€2) - ((C, C2) I JO & ONE. (50) 
This solution requires 20 instructions, and produces a maximum error of ±1 in the final 
result for 6.25% of all possible values ofAi,...^g between [0,255], 

C. Type 3. Filter 3: R = (A^ + A, + A, + A. + Ac + + + Ao + 2*ONE) » 3 
[71] We require 35 instructions to compute this filter by conventional SMD methods. 
For the new SEMD solution, we write the filter as: 

R = {iiAi+A2+A2+A^+ONE)»2) + {{A5+A6+A^+A^+ONE)»2) + E)»l, 
where E is the error/correction term that is necessary for dividing up the expression into 
two parts each containing a Type 2, Filter 2 (27). We have: 
[72] = (C, + C2 + (£• - ^1 - E2)&0NE) » 1 , 

(Ai+A2+A^+A4+ONE)»2 = C, - (E^ & ONE), 
iA^+A^-^A-,+Afi\-0NE)»2 = C2 - (^2 & ONE), 

= EC^ I {EB^ & EB2), E2 = EC2 I {EBj, & EB^), 
P, = {EC, ^ {EB, & EB2)), Po = <EB, ^ EB2), 
e, = {EC2 {EB^ & EB^)), Qo = ~iEB^ ^ EB^), 
E^(P^&QO\iPo&Qo&iP,\Q,)). 

Here E, and E2 are error/correction terms obtained fi-om (27). Defining Ej=(E-Ex- 
E-i)^ {-2, -1,0}, we have the same expression for as in (46), which we simpUfy as: 

S=i.E,'^E2), 

R = PA FG(C] ,€2) - HE &S)&{Ei\E2\E))& ONE - ((Cj ^ C2) & ~(5 ^ iE^) & ONE. 

(51) 

[73] This solution can be fiirther simplified as: 

P = EB^ \EB^, 

Q = EB^\EB2, 
U= (EB2 & EB^ & P) I {EB^ & EBi & 0, 
V=ECi&EC2, 
W=P\Q, 
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ED = {C^-C2), 

R = PAVGiC^,C2)-(ED \U\V\ ((£Cj \EC2) &W))& ONE-ED &U&V& ONE. 

(52) 

The solution in (52) requires 34 instructions, close to the conventional 35 instructions. 
[74] The approximate solution requires the assumption that the least significant bit of 
EBi = l, and EBz = EB3 = EB4 = 0 is: 

R = PA KG(C, ,€2) - ((Ci C2) I EC^ I EC2) & ONE. (53) 
This solution requires 14 instructions, and produces a maximum error of ±1 in the final 
resuh for 9.38% of all possible values ofAi,...^^ between [0,255]. We receive a 
computational advantage of 35:14. 

[75] The second approximate solution makes the assumption EBi = 1 , and EB2 = 0. It 
produces the following solution: 

U^EB^&EB^, 
£D = (C,^C2), 

R = PA VG(Ci ,€2) -{ED\U\ ECy \ EC2) & ONE -ED&U&ECi& EC2 & ONE. 

(54) 

This solution requires 23 instructions, and produces a maximum error of ±1 in the final 
result for 6.25% of all possible values oiAx,...A^ between [0,255]. 

D. Type 3. Filter 4; R= (A i + A, + A, + + Ae + + A, + A«, + ONE) » 3 

[76] We require 35 instructions to compute this filter by conventional SIMD methods. 

For the new SIMD solution, we write the filter as: 

R = {{{Ax+A2+A2+A^WNE)»1) + {{A^+A^+A'j-^A^)»2) + £)»!, 

where E is the error/correction term that is necessary for dividing up the expression into 
two parts each containing a Type 2, Filter. We have: 

/? = (Ci + C2 + {E-Ex -E2)&0NE) » 1, 
(Ax+A2+A2,+A4+0NE)»2 = C, - {E^ & ONE), 
(A^+A6+A-j+A^)»2 = C2 - iE2 & ONE), 
= £Ci I (EBi & EB2), E2 = EC2 1 EB2 I EB4, 
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Qi = iEC2 ^ (EB^ I EB4)), Qo = (EB^ ^ £54), 

&ei)i(i'o&eo&(^iiei)). 

Here £1 and E2 are error/correction terms obtained from (27) and (30) respectively. 
Defining ET={E-E\~ E2) e {-2, -1 , 0} , we have the same expression for /? as in (46), 
which we simplify as: 

S = {E2^E), 

R = PA FG(Ci,C2) -{Ei\S)& ONE - ((Cj ^ Cj) & -(£:, ^ S)) & ONE, (55) 
This solution can be further simpUfied as: 

P^EB^&EB^, 
Q = EBi\EB4, 
U=iEBi & (EB2 I 0) I iEB2 &Q)\P, 
V^EBi &EB2&P, 
W=EC^\EC2, 
Z = iECi&EC2&U), 
ED = iC^^C2), 

R = PA VG(C^,Ci) -iED\U\W)& ONE -ED &((W& V)\Z) & ONE. (56) 
The solution in (56) requires 35 instructions, same as the conventional 35 instructions. 
The approximate solution requires the assumption that the least significant bit of EB\ = 

EB2 = 0,mdEBi = EB4 = lis: 

R = PA VG(Ci ,€2) - ONE ~ ((C, Ci) & EC^ & EC-i) & ONE. (57) 

This solution requires 15 instructions, and produces a maximum error of ±1 in the final 
result for 9.38% of all possible values oiAi,....^^ between [0,255]. We receive a 
computational advantage of 35:15. 

[77] The second approximate solution makes the assumption EB\ = 1, and EB2 = 0. It 
produces the following solution: 

Q^EB^IEB^, 
Z = (ECi&EC2&Q), 
ED = (Ci^C^, 
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R = PA FG(C, ,€2) - {{ED \Q\ECx\ EC2) & ONE) -{ED&,Z& ONE). (58) 
This solution requires 23 instructions, and produces a maximum error of ±1 in the final 
result for 6.25% of all possible values ofA^,.,.^^ between [0,255]. 



E. Type 3. Filter 5; R = (A i +A-. + A, + A^ +Ae + Ai; + A^ + Aq)»3 

[78] We require 33 instructions to compute this filter by conventional SIMD methods. 

For the new SIMD solution, we write the filter as: 

R = {{{^\+A2+A2+A^)»2) + {{A5+Ag^A-j+Ag)»2) + E)»l, 
where E is the error/correction term that is necessary for dividing up the expression into 
two parts each containing a Type 2, Filter (30). We have: 

R = {Cx+C2 + {E-E^- E2)&0NE) » I, 
(^ i+^2+^3+^4)»2 = Ci - (£, & ONE), 
{As+AQ+A-j+Afi)»2 = C2 - {E2 & ONE), 
£1 =EC^ \EB^ \EB2,E2 = EC2\EB3\EB^, 
P, = (£C, ^ (55, I EB2)), Po = {EB^ ^ EB2), 
e, = {EC2 (EB^ I EB4)), Qo = {EB^ ^ EB^), 

E = {P^&Q,)\{Po&Qo&{Pi\QO). 
Here Ei and E2 are error/correction terms obtained firom (27) and (30) respectively. 
Defining ET={E-E\- Ei) e {-2, -1 , 0} , we have the same expression for R as in (46), 
which we simplify as: 

R = PA FG(C, ,C2) - (£] I E2) & ONE - ((C, ^ C2) & -(^i ^ £2 ^ ^) & ONE, (59) 
This solution can be simplified as: 

P = EBi \EB^, 
Q = EB^\EB2, 
U=P\Q, 

V= {EB2 & EBj, & P) I {EB^ & EBi & Q), 
W=EC^\EC2, 
Z={ECy &,EC2&U), 
ED = {CrC2l 
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R = PAVG(C^,C2) - {{ED \U\W)& ONE) - {ED & {{WSc V)\Z)Sc ONE). (60) 
The solution in (59) requires 34 instructions, close to the conventional 35 instructions. 
[79] The approximate solution requires the assumption that the least significant bit of 
EB\ = \, and EBi = EB^ = EBa^Q is: 

R = PAVG{C^,C^ - ONE-{Cx ^ Cj) & EC^ & EC2 & ONE. (61) 
This solution requires 15 instructions, and produces a maximum error of ±1 in the final 
result for 9.38% of all possible values ofA^^.,,Az between [0,255]. We receive a 
computational advantage of 33:15. 

[80] The second approximate solution makes the assumption EBl = 1, and EB2 = 0. It 
produces the following solution: 

W=EC^\EC2. 

R = PAVG{C^,C2) - ONE- ((Q ^ C2) & {(W& EB^ & EB^) \ {EC^ & EC2)) & ONE). 

(62) 

This solution requires 21 instructions, and produces a maximum error of ±1 in the final 
result for 6.25% of all possible values of ^i,...,i4g between [0,255]. 

F, Type 3. Special Filter 1; R = f A i + 2A^ + 2 Ac + 2A^ + + 4*ONE) » 3 

[81] This filter is an important loop filter for de-blocking in JVT video compression 

standards. 

L Conventional SIMP Solution 

[82] This filter can be implemented in SIMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (5 Instructions) A^ = Unpack Low 4 Bytes of for i e {1 ,2,3, 5 J} , 

2. (5 Instructions) A^y^ = Unpack High 4 Bytes of A^, for / e {1,2,3,5,7}, 

3. (9 Instructions) Add and Shift lower 4 words of ^j,. . .,^5 to obtain lower 4 words of 
i?L as: 

= (^IL + 2^3L + 2^5L + 2^7L + ^2L + "^""ONE^) » 3, 
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4. (9 Instructions) Add and Shift higher 4 words of Ai,...,A^ to obtain higher 4 words of 
Rii as: 

■^H = (^IH + + 2^5H + + ^2H + ^*ONE^) » 3, 

5. (1 Instruction) Pack and i?L into final register R. 

We require 29 instructions to compute this filter by conventional SIMD methods. 

ii. Efficient SIMP Solution 
[83] From (34), we get the: 

^1 = PA VGiA^Ail B2 = ^3, 53 = As, ^4 = Aj, 
Ci =PAVG{Bx4z), C2=PAVG{AsA-i), 
EBx = {A^'^A^, EB2 = 0, EB2 = 0, EB^ = 0, 
EC, = iB,-A,),EC2 = (As^Aj). (63) 

In (43) we get: 

R = PA VG(Cx .Cj) - ((Ci^Cz) I (ECi & EC2 & EBO) & (EC^ \ EC2 \EBi)& ONE. 

(64) 

The solution in (64) requires 16 instructions with a computational benefit of 29:16. 
ill. Approximate SIMD Solution 

[84] The approximate solution with the assumptions that the least significant bit of EB\ 
= £Ci=0,and£:C2 = lis: 

R=PA FC?(C, ,C2) - (Ci'^Cz) & ONE. (65) 
It requires 7 instructions, and produces a maximum error of ±1 in the final result for 
12.5% of all possible values o{Ai,...As between [0,255]. The computational advantage is 
29:7 (approx. 4 times speedup). 

G. Type 3. Special Filter 2; R = (A ^ + A^ + A^ + 3A^ + 2Ay + 4*0NE) » 3 
[85] This filter is also an important loop filter for de-blocking in in the TVT video 
compression standard. 
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L Conventional SIMP Solution 

[86] This filter can be implemented in SMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (5 Instructions) A^^ = Unpack Low 4 Bytes of A^, for i e {1 ,2,3,4,7} , 

2. (5 Instructions) A^yi = Unpack High 4 Bytes ofA^, for i e {1,2,3,4,7}, 

3. (8 Instructions) Add and Shift lower 4 words of i4i,...,^5 to obtain lower 4 words of 

i?L as: 

4. (8 Instructions) Add and Shift higher 4 words ofA^,...yA^ to obtain higher 4 words of 

as: 

i?H = (^IH + ^2H + ^3H + 3^4H + 2.47H + 4*OA^£4) » 3, 

5. (1 Instruction) Pack /?h and /?l into final register R, 

[87] We require 27 instructions (including two multipUcations by 3) to compute this 
filter by conventional SIMD methods. 

li. Efficient SIMP Solution 
[88] From (34), we get the: 

Bi ^PAVGiA^Ai), B2^PAVG{A^Aa\ ^3 =-^4, =Aj, 
C, = PA VGiB^ ^2)> C2 = PA VG{A^Ai). 
EB^ = {A^^A2\ EB2 = {A^^A^\ EB^ = 0, EB^ = 0, 

EC^ = (5i^^2). ^C2 = (A^^Aj), (66) 

In (43) we get: 

S^{EB^\EB2). 

R = PAVG{C^,C2) - ((Ci^C2) I (ECi & EC2 &S))& {EC^ \EC2\S)& ONE, (67) 
The solution in (65) requires 19 instructions with a computational benefit of 27:19. 

ill. Approximate SIMD Solution 

[89] The first approximate solution with the assumptions that the least significant bit of 
EQ =EC2 = 1, mdEBx =^^2 = 0 is: 
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R = PAVG(C^,C2) - (Ci^C2) & ONE. (68) 
It requires 8 instructions, and produces a maximum error of ±1 in the final result for 
12.5% of all possible values of Ai,..,yA^ between [0,255]. The computational advantage is 
27:8. 

The next approximate solution is with the assumption that the last bit of EB^ = EB2 = 
1, which gives us: 

R = PAVG{C^,C2) - ((Ci^C2) I {ECx & EC2)) & ONE. (69) 
This solution requires 12 instructions and produces a maximum error of ±1 in the final 
result for 6.25% of all possible values of ^1,..., between [0,255]. The computational 
advMtage is 27:12. 

H. Type 3, Special Filter 3: R = (A i + + + 2A4 + Ac + + A 7 + 4*ONE) » 3 
[90] This filter is used for de-blocking in post-processing. 

i. Conventional SIMP Soltttion 

[91] This filter can be implemented in SMD architecture (assuming sufficient 
memory) by using the following steps: 

I. (7 Instructions) A^ = Unpack Low 4 Bytes of for / e {1,2,3,4,5,67}, 

2, (7 Instructions) A^^ = Unpack High 4 Bytes of A^, for / g { 1 ,2,3,4,5,67} , 

3. (9 Instructions) Add and Shift lower 4 words of ^j,... ^5 to obtain lower 4 words of 
i?L as: 

i?L = (^ IL + ^2L + ^3L + 2^4L + ^5L + ^6L + ^7L + ^''ONE^) » 3, 

4, (9 Instructions) Add and Shift higher 4 words ofAi,...^^ to obtain higher 4 words of 
Rn as: 

% = (^ IH + ^2H + ^3H + 2^4H + ^5H + ^6H + ^7H + ^""ONE^) » 3, 

5. (1 Instruction) Pack Ryi and i?L into final register R. 

We require 33 instructions to compute this filter by conventional SMD methods. 

ii. Efficient SIMP Solution 
[92] From (34), we get: 



29 



B^=PA VG{A , ^2). ^2 = VGiA^As), ^3 = PA VGiA^J-,), B4 = A^, 
C, = PAVG(BM, C2 = PAVG{B:iJ4), 
EB, = (^,^2). ^^2 = iAl-A^X EBj, = (Ae'^Aj), EB^ = 0, 

EC^ = (B^^B^, EC2 = (B^^A^). (70) 

In (43) we get: 

U =EC^\EC2, 
V=EB^\EB2, 
X = V\EB^, 
Y = U\X, 
Z =iECi&EC2&X), 

T = U &V&EBi &EB2&EB2, 
R = PA FG(C„C2) - ((0,^^02) I Z 1 7) & y & ONE, (71) 
The solution in (71) requires 27 instructions with a computational benefit of 33:27. 

iii. Approximate SIMP Solution 

[93J We get an approximate solution with the assumptions that the least significant bit 
of £52 = 1, and = EB^ = 0 as: 

R=PA FG(Ci,C2) - {{C^-Ci) I (£;C, & EC^) & ONE. (72) 
The solution in (72) requires 13 instructions, and produces a maximum error of ±1 in the 
final result for 6.25% of all possible values of A^,...,A-j between [0,255]. The 
computational advantage is 33:13. 

1. Type 3. Summary of Results 

[94] Table V below simunarizes the instructions required to compute each filter (given 
sufficient memory) by the efficient and conventional SIMD methods. For the Efficient 
case, we give the instructions required for the exact and approximate solutions. 
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Summary of Results for Type 3 FIR Filters. 



Type 3 Filters 


t^on vennoii ai 
Method 


Efficient Method 


Speedup 


Exact 


Approx. 


Exact 


Approx. 


(^1 +^2 + ••• +^8 + ^*ONE) » 3 


35 


32 


. 14 _ 


1.1 


2.5 


{Ai+A2 + ...+Ai + 3*0NE) » 3 


35 


35 


14 


1.0 


, 2.5 


(A^+A2 + ...+Ai + 2*0NE) » 3 


35 


34 




1.0 


2.5 


iA^+A2+ ...+As + \*ONE) » 3 


35 


35 


15 


1.0 


2.3 


(/i,+^2+-"+^8)»3 


33 


34 


15 


1.0 


2.2 


(^,+Z43+Z45+Z47+^2+4*OiV£)»3 


29 


16 ; 


• ■•■ 7 i'- 




4.1.:: 


,+y42+/i3+3^4+2y47+4*OA^^»3 


27 


19 


8 


1.4 


3.4 


{A , +^2+^3+2>44+/<5+^6+y47+4*OA^£)»3 


33 


27 


13 


1.2 


2.5 



TABLE V 

[95] The shaded areas show significant improvements in efficiency due to the analyses 
developed here. 



4. Type 4 FIR Filters 

[96] There are 9 different Type 4 FIR filters depending on the 9 choices of c in (9). For 
the sake of brevity, we shall only discuss the case of c=8. We define the following packed 
64-bit registers, each containing 8 data elements of one byte each: 

5, = PA VGiA^Ai), B2 = PA VG(A-^Aa), 53 = PA VG{A^4d, = PA VG(A^^^), 
Bs = PAVGiAg^^o), Bf, = PAVG(Ayi4i2)> ^7 = PAVGiA^^Aul ^8 = PAVGiA^sJ^^), 
C, =PAVG(BM, C2 = PAVG(B^^^), C-,=PAVG{BM^ Q^PAVGiBM^ 
D, =PAVG(C^,C2),D2 = PAVG(C^,C^), 

EB^ = {A^^A2), EB2 = (^3%). EB^ = (A^^A^X EB, = (A^^A^\ 
EBs = (V^io). = iM,2), EBj = (^,3^,4). EB^ = (A^s^Aie), 
EC^ = {B^^BH, EC2 = {B^^B^), EC^ = (^j^^g), EC^ = {B^'^B^^, 
ED^^{C,^C2),ED2 = {Cj,^C^\ 



31 



£i = £Ci & {EB^ I EB^, E2 = £:C2 & {EB^ \ EB^), 
£3 = ECi & (£^5 I EB^), E^ = & {EB^ \ EB^), 

Pi = {EC^ ^ ~{EBy I £52)), Po = (^^1 ^^2). 
Q, = ^ ~(£53 I ^^^4)), Qo = (^^3 ^ £54), 

= (p, & e,) I & ^0 & QoX 

/?, = {EC^ - ~(£55 I £^6)). = (£i55 ^ £5e), 
5i = (£:C4 ^ ~(£57 I EB^)), So = (jF57 ^ £^8), 
£:i?2 = (^i&5i)|((i?, |Si)&i?o&5o). 

C/2 = ED^ ^E^^E^^ ((Pi & (2i) I ((Pi I Si) & Po & Qo)). 
t/i=Pi^ei'^(i'o&eo). 

F2 = ED2 -E^^E^^ ((/?, & S,) I ((/?, I Si) & /?o & So)), 
K,=/?i'^SiM^o&5o), 
Vq = Rq^Sq, 

E = (U2& F2) I ((C/2 1 Vj) & f/i «& Ki) I ((C/2 1 F2) & (C/, I V,) & C/o & Fo), 

£Ji = (££>i I (£, ^ ^2 ^^1)) & (^1 1 ^2 I ~^^i). 
ET2 = (£I>2 1 (^3 ^ ^^2)) & (^3 I ^4 1 -^^2)- 



A. Type 4, Filter 1 : R = (A , + A, + ... + + + 8*ONE) » 4 
i. Conventional SIMP Solution 

[97] This filter can be implemented in SIMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (16 Instructions) A^^ = Unpack Low 4 Bytes of ^4,, for i = 1 , . . . , 1 6, 

2. (16 Instructions) = Unpack High 4 Bytes of ^y, for z = 1 16, 
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3. (17 Instructions) Add and Shift lower 4 words ofAi,.. .^j^ to obtain lower 4 words of 

Rias: 

= IL + ^2L + • • • + ^15L + ^16L + ^*ONE^) » 4, 

4. (17 Instructions) Add and Shift higher 4 words of A^,...^i^ to obtain higher 4 words 
of /?H as: 

i?H = (^IH + ^2H + • • • + ^!5H + ^16H + ^*ONE^) » 4, 

5. (1 Instruction) Pack and R]^ into final register R. 

We require 67 instructions to compute this filter by conventional SIMD methods. 

u. Efficient SIMP Solutioii 

[98] In order to implement this filter efficiently, we simplify it as follows: 

R = ((A1+A2+. . .+A-j+Ai+4*ONE)»3 + (Ag+AiQ+. . .+A^s+Ai^+4*ONE)»3 + E)»l, 

(74) 

where E is the error/correction term that is necessary for dividing up the expression into 
two parts each containing a Type 3, Filter 1 (42), Note that according to (42) and the 
registers in (73), we have: 

iA^+A2+. . .+A^+A^+4*ONE)»3 = I>, - (ET^ & ONE), 
(,Ag+AiQ+. . .+Ais+Ai^+4*ONE)»3 = - (^^2 & ONE), (75) 
where ET^ and ET2 are error/correction terms obtained in (42). We can simplify (74) as: 

/? = (Di + Z>2 + (E- ETi - ET2)&0NE) » 1 . (76) 
Note that ttie expressions for ^, ETi, and ET2 are given in (73). We note that E, ET^, ET2 
€ {0, 1 } , and = (£ - £rj - ET2) e {-2, -1 , 0, 1 } . From (76), we have: 

PAVG{Dx,D2) - ONE - (D, ^Z^z) & ONE when Ej = -2 
PAVG{D^,D2)-0NE when £7. =-1 

PAVG{Dx,D2)-{P^^D2)8lONE when£;r=0' 
PA VG{I\ ,Z>2) when Ej = 1 

We simplify (77) as: 

R=PA VGiPyJ)2) - ((^^1 & ET2) I ~E) & (£:r, \ET2\E)8l ONE 

- (Di''D2) & ~iETi^ET2^E) & ONE. (78) 
We can fiirther simplify (78) as: 
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t/j = At least 1 ED, 
U2 = At least 1 EC, 
C/3 = At least 1 EB, 

1/4= Both JEDs, 
C/5 = At least 2 ECs, 
= At least 3 ECs, 
Ut = A\\4ECs, 
(/g = At least 3 EBs, 
C/9 = At least 5 EBs, 
UiQ = At least 7 EBs, 
Uii=At least 1 ED, EC or EB, 
£1 = (£:^& t/„) I (C/4 & (t/2 1 C/3)) I (C/i & t/g) 1 (C/, & C/5 & C/3) 1 (f/, & C/2 & t/g) I 
(C/, & C/9) 1 (t/7 & C/3) I (C/g & C/g) I (C/5 & C/9) I (C/2 & C/,o), 
£2 = C/4 & ((C/7 & C/3) I (C/g & C/g) 1 (C/5 & c/9) I (C/2 & C/10))) I 

(£X& C/, & ((C/7 & C/9) I (C/fi & C/,o)), 

[99] Clearly, (79) is an inefficient solution and is useful in special cases and for 
approximate solutions only. 

iii. Approximate SIMP Solution 

[100] We have many approximate solutions by assuming the least significant bit oiEB\, 
EBi, EC\, ... ECa, EDu or ED2 as 0 or 1 . With the assumption the last bit of £ = ETx = 
ET2 = \, we get fi-om (78): 

R = PAVG{DiJ)2)- ONE. (80) 
This solution requires 16 instructions, and produces a maximum error of ±1 in the final 
resuh for 8.6% of all possible values of Ax,...^^^ between [0,255]. The error never 
exceeds ±1. We receive a computational advantage of 67:16, and approximate 4 times 
speedup. 
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B, Type 4. Special Filter 1 ; R = (A^ + + 6A2 + 4A^ + Ac + 8*ONE) » 4 
[101] This filter is a Gaussian approximation filter used for post-processing. 

i. Conventional SIMP Solution 

[102] This filter can be implemented in SIMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (5 Instructions) A^^ = Unpack Low 4 Bytes of A^y for i g {1,2,3,4,5}, 

2. (5 Instructions) A^^ = Unpack High 4 Bytes of .4^, for / e {1,2,3,4,5}, 

3. (9 Instructions) Add and Shift lower 4 words ofA^, to obtain lower 4 words of 

i?L = 1 L + 4^2L + 6^3L + 4^4L + + S'^ONE^) » 4, 

4. (9 Instructions) Add and Shift higher 4 words of .4|, /I5 to obtain higher 4 words 
of i?H ^• 

= (^IH + 4^2H + 6^3H 4^4H + ^5H + 8*C^A^i&4) » 4, 

5. (1 Instruction) Pack and i?L final register R, 

We require 29 instructions to compute this filter by conventional SIMD methods. 

ii. Efficient SIMD Solution 

From (73), we get the: 

B^=PAVGiA^J2lB2 = A^, B^=A^,B^ = A^,Bs=A^,B^^A^, B^=A^,B^=A^, 

Ci ^PAVG{B^Ai\ C2 =^3. C3 =^4, C4 = ^5, 
D, = PA FG(Cj ,^3), i)2 = PA VG{A^A^\ 
EBi =^i^^2^ EB2=EB^ =EB4=EBs =EB^=-EBj^EB^ = 0, 
^Ci - (^1^3), = i:C3 = EC4 = 0, 

E\ — EC\ & EB^j E2 ~ E^ = £4 = 0, 
Pi = £Ci ^ Po = Qi = hQo = 0, ERi = Pi, 
/?, = !, Ro = 0,Si = l,So = 0,ER2 = h 
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£r, = (£Z), I {E^ ^ ER{)\ ET2 - 0. (81) 

From (78) we get: 

R = PA FG(Z)i ^2) - & £7^1 & ONE) - iD^^D2) & -{ET^^'E) & ONE, (82) 
We can simplify (82) as follows: 

U=EC^\EB^, 

R = PAVG{D^J)2) - ((Pi''^2) & (EDi I £Z)2 1 1^) I (^^1 & ^^2 & t/)) & ONE, 

(83) 

The solution in (83) requires 19 instructions with a 29:19 computational advantage. 
iii. Approximate SIMP Solution 

[103] We can assume the least significant bit of EDu ED2^ ECu or EB\ as 0 or 1 to get 
several approximate solutions. We first make the assumption that the least significant bit 
of £Di=l, and ED2 = ECi = EBi = 0, to get the following solution: 

R=PA VG{D^ ^2) - (^1 ^^2) & ONE, (84) 
This solution requires 8 instructions, and produces a maximum error of ±1 for 12.5% of 
all possible values of yli,...^5 between [0,255], The computational advantage is 29:8 
(more than 3 times speedup). 

The second approximate solution makes the assumption that the least significant bit of 
j&Cj = 1, and £S| = 0, to get the solution: 

R = PAVG(D^JD2) - {{D^''D2) \ (ED^ & ED2)) & ONE. (85) 
This solution requires 12 instructions, and produces a maximum error of ±1 for 6.25% of 
all possible values ofA^y,..yA^ between [0,255]. The computational advantage is 29:12. 

C. Type 4, Special Filter 2; 

R = (A ,+A ^+2Ag+2A^+2A,+2AB+4Ao+A^+A^+8*ONE) » 4 

[104] This filter is also a Gaussian approximation filter used for post-processing. 
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i. Conventional SIMP Solution 

[105] This filter can be implemented in SMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (9 Instructions) A^^ = Unpack Low 4 Bytes ofA^, for i e {1,2,. . .,9}, 

2. (9 Instructions) A^ = Unpack High 4 Bytes of ^„ for i e {1,2,. . .,9}, 

3 . (15 Instructions) Add and Shift lower 4 words of ^ j , . . . , to obtain lower 4 words of 
i?L as: 

/?L = (^iL + ^2L + 2^5L + 2^6L + 2^7L + 2^8L + 4^9L + ^3L + ^4L + 8*0^^4) » 4, 

4. (15 Instructions) Add and Shift higher 4 words of A^, ...» A5 to obtain higher 4 words 
ofRu as: 

Rn = (^,H + v42H + 2^5H + 2^6H + 2^7H + 2^8H + 4^9H + ^3H + ^4H + ^*ONE^) » 4, 

5. (1 Instruction) Pack and i?L i^^to filial register ii. 

We require 49 instructions to compute this filter by conventional SIMD methods. 

ii. Efficient SIMP Solution 
[106] From (73), we get the: 

Bi ^PAVGiA^Ai), B2^PAVG(A-^^^% =As, B^=A^, B^ =Ay, B^ =As, Bj =Ag, B^ = 

Ag, 

C, ^PAVG(BM, C2=PAVG(AsJ6l PAVG{A^A%), C4 -Ag, 
Di =PAVG(Cx,C2), D2 =PAVGiC^Ag), 
EBi=iAi ^^^2), EB2 = (^3^^4), EB^ = EB4 = EBs = EB^ = EBj = EBg = 0, 
£Ci = (Bi^Bz), EC2 = iAs^A^), EC3 = (Ay^Ag), EC4 = 0, 
EDi=iCi^C^,ED2 = (C3^Ag), 

El = ECi & (EBi I EB2), ^2 = ^3 = ^4 = 0, 
P, =£C, ~(£fi, I EB2I Po = (EBi^EB^, e, = ~£C2, So = 0. -^^i = ^1 & Gi, 
/?j = ~EC2, Rq = 0, iS] = 1, Sq = 0, £i22 ~ -^i' 
U2 = EDi - El ^ (Pi & e,), t/, = P, - e„ C/o - Po> 
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E = {U2&V2)\{iU2\V2)&U^&V,), 

ET^ = I (£:, ^ ER^)) & (£, I ~ER^), ET^ = {ED^ & ~£i?2)- (86) 
[107] The final solution is same as (78). We can simplify this solution as follows: 

U=EB^\EB2, 
V^EC^ \EC-i, 
W=EC2\U\ V, 
Z = ED^ \ED2, 

F=Z\ W, 
H=ECi &EC2, 
G = iEC2&V)\H, 

R=PA VG(DiJD2) - (((P^^D2)ScF) | {ED^&ED2&,W) \ (Z & ((EC2&JI)\iG&U)))) & ONE. 

(87) 

[108] The solution in (87) requires 36 instructions with a 49:36 computational 
advantage. 

iii. Approximate SIMP Solution 

[109] We suggest 2 approximate solutions for this filter. For the first approximate 
solution, we assume the least significant bit of £'C3=0, and EB\= EBi^X to get the 
following: 

R^PA VG(D^J)2) - ((D^''D2) I (ED^&ED2) \ ((ED^ \ED2) & EC^ & EC2)) & ONE. (88) 
This solution requires 21 instructions, and produces a maximum error of ±1 for 6.25% of 
all possible values of Ai,,.,,Ag between [0,255]. The computational advantage is 49:21 
(more than 2 times speedup). 

[110] The second approximate solution makes the assumption that the least significant 
bit ofEBi = = 1. We get the solution: 

f/= (EC I & iEC2 I EC^)) I (£C2 & EC^l 
R = PAVG{DxJ^2) ' (i^i^D2) 1 {ED^&ED2) \ {(ED^\ED2) &U))& ONE. (89) 
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This solution requires 25 instructions, and produces a maximum error of ±1 for 3.12% of 
all possible values of Ai,...^g between [0,255]. The computational advantage is 49:25 
(nearly 2 times speedup). 

D. Type 4. Special Filter 3; 

R = (A i +2A,+2A^+2A^+2A<+2A^c+2A7+2A«f+Ao+8*ONE) » 4 
[111] This filter is also used for post-processing. 

i. Conventional SIMP Solution 

[112] This filter can be implemented in SMD architecture (assuming sufficient 
memory) by using the following steps: 

1. (9 Instructions) A^i^ = Unpack Low 4 Bytes of ^4,, for i e {1,2,. ..,9}, 

2. (9 Instructions) = Unpack High 4 Bytes of A^, for / € { 1 ,2,. . .,9} , 

3. (17 Instructions) Add and Shift lower 4 words of .^4 1 , ...,A^io obtain lower 4 words of 

i?L = (^IL + 2^2L + + 2^4L + ^5L + ^6L + ^7L + ^SL + ^9L + ^*ONE^) » 

4, 

4. (17 Instructions) Add and Shift higher 4 words of ^i, A^ to obtain higher 4 words 

of R^i as: 

Rti = (^iH + 2^2H + 2^3H + 2^4H + ^5H + ^6H + 2^7H + ^SE + ^9H + S^ONE^) » 

4, 

5. (1 Instruction) Pack i?H and ii^to fii^^I register R. 

We require 53 instructions to compute this filter by conventional SIMD methods. 

ii. Efficient SIMP Solution 

[1 131 From (73), we get the: 

Bi=PA VG(A , J2I B2=A2, Bj=A^,B4= A4, B^ = A^, B^ = A^, Bj=A^,B^=A^, 

Ci =PAVG{B^A2\ C2=PAVG{A:i4d, C3 =PAVG{AsJq), 0^ = PAVGiA-,A%), 
Di =PAVG(,C^,C:^,D2=PAVGiC^,C4), 
EB^ = (.41M9), EB2 = EB-i = EB4 = EB^ = EB^ = EB-j = EB^ = 0, 
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P, = £C, ~^5„ Po = ^^1, 01 = ~EC2, Qo = 0, = P, & e„ 

i?j = ^ECt^^ Rq = 0, 5i = ^EC^f Sq — 0, £'i?2 = i?i & 'S'l, 

F2 = ED^ ^ & 5,), = /?, 5„ Fo = 0, 

£=(C/2&F2)|((i72|F2)&[/,&Fi), 
= I (£, - ER^)) & (£i I ~£i?,), ET^ = {ED2 & ~£/?2)- (90) 
The final solution is same as (78). We can simplify this solution as follows: 

Ux=ECxSlEC2, 
U2 = EC2&EC^, 
U2=ED^&ED2, 
Vi=ECi\EC2, 
V2 = EC^\EC^, 
V-i=ED,\ED2, 
V,= V2\EB,, 
W=(Vi&V2& EBO I (C/, & F4) I {U2 & (Fi I EBO), 
P = F, I F4, 

G =Ui&U2&EBi, 
H=G&U^&,E, 

R = PAVG{DM - (i(E & (F3 I P)) I (C/3 & F) I (F3 & fF) I <J) & ONE) -(H& ONE). 

(91) 

[114] The solution in (91) requires 46 instructions with a 53:46 computational 
advantage. 
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iii. Approximate SIMP Solution 

[115] We suggest 2 approximate solutions for this filter. For the first approximate 
solution, we assume the least significant bit of £'Ci = EC2 = 1, and EC3 = EC4 = 0 to get 
the following: 

R = PAVG{D^JD2) - ((i)i^i)2) I (^^1 & ^^2) I ((^^1 I ^^2) & ^^1)) & ONE. (92) 
This solution requires 19 instructions, and produces a maximum error of ±1 for 9.38% of 
all possible values of A^^^.^Ag between [0,255]. The computational advantage is 53:19 
(more than 3 times speedup). 

The second approximate solution makes the assumption that the least significant bit of 
^Cj = 1, and EC2 = 0. We get the solution: 

FF= (£C4 & EB^ I (EC2 & {EC^ I EB^)l 

R = PAVG(D^J)2) - (Pi''^2) I (^^1 & ED2) I ({ED^ I ED2) &W))& ONE. (93) 
[116] This solution requires 25 instructions, and produces a maximum error of ±1 for 
6.25% of all possible values of ^1,..., Ag between [0,255]. The computational advantage is 
53:25 (more than 2 times speedup). 

E, Type 4, Special Filter 4: R = (A | +2A^+3A^+4A^+3AcH-2A^+A^+8*ONE) » 4 
[117] This filter is also used for post-processing. 

i. Conventional SIMP Solution 

[118] This filter can be implemented in SIMD architecture (assuming sufficient 
memory) by using the following steps: 

1 . (7 Instructions) A^^ = Unpack Low 4 Bytes of ^4^, for / g { 1 ,2, ... ,7} , 

2. (7 Instructions) A^^ = Unpack High 4 Bytes ofA^, for i e {1,2,. . .,7}, 

3. (13 Instructions) Add and Shift lower 4 words of ^1, ^5 to obtain lower 4 words of 
^Las: 

R^ = (^iL + 2^2L + 3.43L + 4^4L + 3^5l + 2^6l + ^7L + ^""ONE^) » 4, 

4. (13 Instructions) Add and Shift higher 4 words of .^j, ...,A^ to obtain higher 4 words 

of i?H ^s: 
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Rii = (^,H + 2^2H + 3^3H + 4/t4H + S^jh + 2A^u + AjH + S*ONE^) » 4, 
5. (1 Instruction) Pack ii^ and ^ into final register R. 

We require 41 instructions to compute this filter by conventional SIMD methods. 



ii. Efficient SIMP Solution 
[119J From (73), we get the: 

fi, = PA VG{A^^t), B2 = A2, £3 = ^4, 54 = ^4, 55 = ^3, ^6 = ^5, 57 = 58 

P/<^C7(^3^5), 

C, = PA VG{BiJ2l C2 = A^, C3 = PA VGiA^^sl Q = PA VG(AM> 

Di = PAVGiCiJ^), D2 = PA VGiC^.C^), 
EBi =A{'Aj, EB2 = EB2 = EB^ = EB^ = EB^ = EBj = 0, EB^ = A^^A^, 

EC^ = B^'^Aj, EC2 = 0, EC-i = A:^^As, EC^ = A^'^Bs, 
EDi = C^^A^, ED2 = C3^C4, 

£1 = £Ci & EB^, E2 = E^ = 0, Ei^ = EC^ & EB,^, 
P^ =EC^ ^~EBi, Po = EBi, = 1, Qq = 0, ER^ = P^, 
i?j — ^EC^j Rq ^ 0, Si — EC^ ^ ^EB^^ Sq = EB^^ ~ -^i ^ 

U2 = ED, ^Ei t/i =~P„ Uo^Po, 

^ = (t/2 & V2) I ((f/2 I F2) & C/, & F,) I ((C/2 I V2) & (C/, I F,) & C/o & Fo), 

^r, = I ^ p,)) & (£, I ~p,), jE:r2 = (^i>2 1 (^4 ^ ^^2)) & (^4 1 -eRj). 

The final solution is same as (78). We can simpUfy this solution as follows: 

Ui=EC^\EC^, 
U2 = EBi\EB^, 
U3=EDi\ED2, 
C/4 = £C,|C/i, 

Us = U4\U2, 

U6 = Us\Ui, 
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Uj = EC2&EC^, 
Us-{EC^&UO\Uj, 
Ug = iEC^&Uj)\(U^&U2l 
R = PAVGiDiJ)2) - ((Pi'^Z^z) & f^6) I (EDi & ED2 & C/5) | (f/3 & 179)) & OAT^. (95) 
The solution in (95) requires 36 instructions with a 41:36 computational advantage. 

Approximate SIMP Solution 
[120] We suggest 2 approximate solutions for this filter. For the first approximate 
solution, we assume the least significant bit of EC\ = EC^ = 1 , and £€4 = EB\ = 0 to get 
the following: 

R^PA VGiDM - {{D^^D2) I {EDx & ED2) \ ((ED^ \ ED2) & EB^)) & ONE. (96) 
This solution requires 19 instructions, and produces a maximum error of ±1 for 6.25% of 
all possible values of A^^.^.^-j between [0,255]. The computational advantage is 41:19 
(more than 2 times speedup). 

[121] The second approximate solution malces the assumption that the least significant 
bit of EC3 = 1, and £5i = 0. We get the solution: 

Ug = (EC, & EC4) I {{ECy I EC4) & EB^\ 
R^PA VG{D^ ,2)2) - {{D^'-D^ I {EDx & ED2) \ {{ED^ \ ED2) & C/9)) & ONE. (97) 
This solution requires 25 instructions, and produces a maximum error of ±1 for 3.13% of 
all possible values oiAx,..>Ai between [0,255]. The computational advantage is 41:25. 

F. Type 4. Summary of Results 

[122] Table VI below summarizes the instructions required to compute each filter 
(given sufficient memory) by the efficient and conventional SMD methods. For the 
Efficient case, we give the instructions required for the exact and approximate solutions. 



Summary of Results for Type 4 FIR Filters. 



Type 4 Filters 


Conventional 
Method 


Efficient Method 


Speedup 


Exact 


Approx. 


Exact 


Approx. 




67 


N/A 


16 


N/A 


' 4,2 


{A , +4.42+6^3+4^4+^5+8* C)A^£)»4 


29 


■ 


8 


1.5 


3.6 
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(A ^+A2+2AJ+2A^+2AJ+2Ag■HAg+AJ+A^+S*ONE)i» 4 


49 


^;,36- \ i 


41'^ 


1,4. 




(A^+2A^+2A^+2A^+2Aj+7A^+U^+2A^+A,+S*ONE)i»4 


53 




19-: 




r; .2.8 . , 


^A^+2A^+3AJ+4A^+3A^+2A^+A^+&*ONE)p'> 4 


41 


36 




1.1 





TABLE VI 



[123] The shaded areas show significant improvements in efficiency due to the analyses 
developed here. 

[124] Although the invention has been discussed with respect to specific embodiments 
thereof, these embodiments are merely illustrative, and not restrictive, of the invention. 
For example, although a two-operand SIMD instruction has been primarily discussed, 
techniques and features of the invention may be applicable to other applications where 
the number of operands, or arguments, of a SMD instruction are not the same as the 
number of variables, or values, in a formula, computation or function to be implemented 
with the SMD instruction (i.e, a "mismatched" instruction). 
[125] Although the invention has been described with respect to specific SMD 
instructions to obtain an average of values, any other type of SMD instruction or 
operation may benefit from the approach of the invention. Although specific operations 
such as addition, subtraction, bitwise AND, bitwise OR, bitwise logical right shift, 
bitwise logical left shift, bitwise exclusive OR, etc., are used in specific embodiments to 
achieve a result, other embodiments may use different operations, or combinations of 
operations, to achieve results. For example, an AND function can be reaUzed by using an 
OR function and complementing, or inverting, the operands and resuh. Other such 
operational equivalents will be apparent. 

[126] Alternative methods of detecting when the sum of two packed values results in an 
odd number can be employed. Some processors may provide instructions that combine 
multiple operations into compound one or more instructions. Although specific reference 
has been made to a "SMD" type of instruction, other types of parallel instructions may 
be within the scope of the invention. Although the SMD instruction has been described 
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as a single instruction, other embodiments may use SIMD instructions that occupy more 
than a single instruction's worth of clock cycles, instruction cycles, or the like. 
[127] There are various ways that the invention can be modified from specific 
embodiments described herein to achieve similar results. For example, adjustments to an 
approximate solution are performed as an intermediate step before computing the final 
result so that the approximate solution is no less than the actual solution. One 
modification can be to adjust the approximate solution so that it becomes no larger than 
the actual solution as an intermediate step. Such modifications will be apparent to one of 
skill in the art and are within the scope of the invention. 

[128] Any suitable programming language can be used to implement the routines of the 
present invention including C, C++, Java, assembly language, etc. Different 
programming techniques can be employed such as procedural or object oriented. The 
routines can execute on a single processing device or multiple processors. Although the 
steps, operations or computations may be presented in a specific order, this order may be 
changed in different embodiments. In some embodiments, multiple steps shown as 
sequential in this specification can be performed at the same time. The sequence of 
operations described herein can be interrupted, suspended, or otherwise controlled by 
another process, such as an operating system, kernel, etc. The routines can operate in an 
operating system environment or as stand-alone routines occupying all, or a substantial 
part, of the system processing. 

[129] Steps can be performed in hardware or software, as desired. Note that steps can 
be added to, taken from or modified from the steps presented in this specification without 
deviating from the scope of the invention. Li general, the flowcharts are only used to 
indicate one possible sequence of basic operations to achieve a ftmctional aspect of the 
present invention. 

[130] hi the description herein, numerous specific details are provided, such as 
examples of components and/or methods, to provide a thorough understanding of 
embodiments of the present invention. One skilled in the relevant art will recognize, 
however, that an embodiment of the invention can be practiced without one or more of 
the specific details, or with other apparatus, systems, assemblies, methods, components, 
materials, parts, and/or the like. In other instances, well-known structures, materials, or 
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operations are not specifically shown or described in detail to avoid obscuring aspects of 
embodiments of the present invention. 

[131] A "computer-readable medium" for purposes of embodiments of the present 
invention may be any medium that can contain, store, communicate, propagate, or 
transport the program for use by or in connection with the instruction execution system, 
apparatus, system or device. The computer readable mediiun can be, by way of example 
only but not by limitation, an electronic, magnetic, optical, electromagnetic, infrared, or 
semiconductor system, apparatus, system, device, propagation mediimi, or computer 
memory. 

[132] A "processor" includes any system, mechanism or component that processes data, 
signals or other information. A processor can include a system with a general-purpose 
central processing unit, multiple processing units, dedicated circuitry for achieving 
functionality, or other systems. Processing need not be limited to a geographic location, 
or have temporal limitations. For example, a processor can perform its functions in "real 
time," "offline," in a "batch mode," etc. Portions of processing can be performed at 
different times and at different locations, by different (or the same) processing systems. 
[133] Reference throughout this specification to "one embodiment", "an embodiment", 
or "a specific embodiment" means that a particular feature, structure, or characteristic 
described in connection with the embodiment is included in at least one embodiment of 
the present invention and not necessarily in all embodiments. Thus, respective 
appearances of the phrases "in one embodiment", "in an embodiment", or "in a specific 
embodiment" in various places throughout this specification are not necessarily referring 
to the same embodiment. Furthermore, the particular features, structures, or 
characteristics of any specific embodiment of the present invention may be combined in 
any suitable manner with one or more other embodiments. It is to be understood that 
other variations and modifications of the embodiments of the present invention described 
and illustrated herein are possible in Ught of the teachings herein and are to be considered 
as part of the spirit and scope of the present invention. 

[134] Embodiments of the invention may be implemented by using a programmed 
general purpose digital computer, by using application specific integrated circuits, 
programmable logic devices, field programmable gate arrays, optical, chemical. 
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biological, quantum or nanoengineered systems, components and mechanisms may be 
used. In general, the functions of the present invention can be achieved by any means as 
is known in the art. Distributed, or networked systems, components and circuits can be 
used. Communication, or transfer, of data may be wired, wireless, or by any other 
means. 

[135] It will also be appreciated that one or more of the elements depicted in the 
drawings/figures can also be implemented in a more separated or integrated manner, or 
even removed or rendered as inoperable in certain cases, as is useful in accordance with a 
particular application. It is also within the spirit and scope of the present invention to 
implement a program or code that can be stored in a machine-readable medium to permit 
a computer to perfomi any of the methods described above. 

[136] Additionally, any signal arrows in the drawings/Figures should be considered 
only as exemplary, and not limiting, unless otherwise specifically noted. Furthermore, 
the term "or" as used herein is generally intended to mean "and/or" unless otherwise 
indicated. Combinations of components or steps will also be considered as being noted, 
where terminology is foreseen as rendering the ability to separate or combine is unclear. 
[137] As used in the description herein and throughout the claims that follow, "a", "an", 
and "the" includes plural references unless the context clearly dictates otherwise. Also, 
as used in the description herein and throughout the claims that follow, the meaning of 
"in" includes "in" and "on" unless the context clearly dictates otherwise. 
[138] The foregoing description of illustrated embodiments of the present invention, 
including what is described in the Abstract, is not intended to be exhaustive or to limit the 
invention to the precise forms disclosed herein. While specific embodiments of, and 
examples for, the invention are described herein for illustrative purposes only, various 
equivalent modifications are possible within the spirit and scope of the present invention, 
as those skilled in the relevant art will recognize and appreciate. As indicated, these 
modifications may be made to the present invention in light of the foregoing description 
of illustrated embodiments of the present invention and are to be included within the 
spirit and scope of the present invention. 

[139] Thus, while the present invention has been described herein with reference to 
particular embodiments thereof, a latitude of modification, various changes and 
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substitutions are intended in the foregoing disclosures, and it will be appreciated that in 
some instances some features of embodiments of the invention will be employed without 
a corresponding use of other features without departing from the scope and spirit of the 
invention as set forth. Therefore, many modifications may be made to adapt a particular 
situation or material to the essential scope and spirit of the present invention. It is 
intended that the invention not be limited to the particular terms used in following claims 
and/or to the particular embodiment disclosed as the best mode contemplated for carrying 
out this invention, but that the invention will include any and all embodiments and 
equivalents falling within the scope of the appended claims. 
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